Skip to content

refactor:add parallelization optimization to bpcg#7416

Open
Missing-Hex wants to merge 3 commits into
deepmodeling:developfrom
Missing-Hex:refactor/bpcg
Open

refactor:add parallelization optimization to bpcg#7416
Missing-Hex wants to merge 3 commits into
deepmodeling:developfrom
Missing-Hex:refactor/bpcg

Conversation

@Missing-Hex
Copy link
Copy Markdown

OpenMP Parallelization for BPCG CPU Kernels

Summary

This PR implements OpenMP parallelization for the two hotspot functions in the BPCG (Block Preconditioned Conjugate Gradient) diagonalization algorithm:

  • line_minimize_with_block_op<CPU>
  • calc_grad_with_block_op<CPU>

Motivation

The BPCG algorithm is an iterative diagonalization method used in ABACUS for solving the Kohn-Sham equations. Profiling shows that line_minimize_with_block_op and calc_grad_with_block_op are the primary hotspots within each iteration, consuming significant CPU time when processing multiple bands.

Since bands are independent of each other and access disjoint memory regions, parallelizing over the band dimension is both safe and efficient.

Changes Made

1. Parallelization Strategy

Both functions are restructured into multi-phase pipelines that separate compute-intensive loops from MPI collective operations:

Phase Operation Parallelization
Compute BLAS dot products, normalization, accumulation #pragma omp parallel for schedule(static)
Communication Parallel_Reduce::reduce_pool() Serial (batched array reduction)

2. Key Technical Decisions

Thread Safety

  • MPI collective operations (MPI_Allreduce via Parallel_Reduce::reduce_pool) are not thread-safe and are executed serially outside parallel regions
  • Compute loops are fully parallelized with no shared state between threads

Batched MPI Reduction

  • Original code: N scalar reductions → N MPI calls
  • Optimized code: 1 array reduction → 1 MPI call
  • Reduces MPI communication overhead significantly

Static Scheduling

  • schedule(static) is used because each band has equal workload (n_basis operations)
  • Provides optimal cache locality and minimal scheduling overhead

Conditional Compilation

  • All OpenMP pragmas are guarded by #ifdef _OPENMP
  • Code compiles and runs correctly when OpenMP is disabled

3. Memory Access Pattern

Each band accesses a contiguous memory block:

[band_idx * n_basis_max, (band_idx + 1) * n_basis_max)

This ensures:

  • No false sharing between threads
  • Efficient cache utilization
  • Predictable memory access patterns

Performance Impact

Theoretical Speedup

  • Compute-bound sections: Linear scaling with number of cores (up to n_band)
  • MPI communication: Reduced from O(N) calls to O(1) calls

Expected Behavior

  • Best case: Near-linear speedup for large n_band on multi-core systems
  • Communication overhead is amortized across all bands

Code Structure

line_minimize_with_block_op<CPU> (5 phases)

  1. Parallel BLAS dot for per-band norms
  2. Batch MPI reduction of norms
  3. Parallel normalization and epsilon accumulation
  4. Batch MPI reduction of epsilons
  5. Parallel rotation application

calc_grad_with_block_op<CPU> (7 phases)

  1. Parallel BLAS dot for per-band norms
  2. Batch MPI reduction of norms
  3. Parallel normalization and epsilon accumulation
  4. Batch MPI reduction of epsilons
  5. Parallel error and beta computation
  6. Batch MPI reduction of errors and betas
  7. Parallel gradient update and output

Testing

  • Correctness: Results match serial version
  • Thread safety: No data races detected
  • Performance: Benchmarked on multi-core systems
  • Compatibility: Builds with and without OpenMP

Files Modified

  • source/source_hsolver/kernels/bpcg_kernel_op.cpp

Backward Compatibility

  • No API changes
  • No interface modifications
  • Existing code continues to work without modification

Notes

The parallelization follows the same pattern already used in refresh_hcc_scc_vcc_op within the same file, ensuring consistency with existing codebase conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants